Probe region expression estimation for RNA-seq data for improved 

microarray comparability 



en 

o 

(N 

u 

< 
in 



Karolis Uziela ^'^ and Antti Honkela ^ 

^ Helsinki Institute for Information Technology HUT, Department of Computer Science, 

University of Helsinki, Helsinki, Finland 

^ Science for Life Laboratory and Department of Biochemistry and Biophysics, 

Stockholm University, Solna, Sweden 






I 



> 

oo 

o 
m 



X 



Abstract 

Rapidly growing public gene expression databases 
contain a wealth of data for building an unprecedent- 
edly detailed picture of human biology and disease. 
This data conies from many diverse measurement 
platforms that make integrating it all difficult. In 
this paper, we propose a new method for processing 
RNA-sequencing data that yields gene expression es- 
timates that are much more similar to corresponding 
estimates from microarray data, hence greatly im- 
proving cross-platform comparability. The method 
we call PREBS is based on estimating the expression 
only from microarray probe regions, and processing 
these estimates with microarray summarisation algo- 
rithm RMA. This allows new ways of using RNA- 
sequencing data, such as expression estimation for 
microarray probe sets. Gene signatures defined based 
on PREBS expression measures of RNA-sequencing 
data are much more accurate for retrieval of similar 
microarray samples from a database. 



1 Introduction 

Public gene expression databases such as ArrayEx- 
press fll and Gene Expression Omnibus l2| host pub- 
lic data from more than half a million gene expres- 
sion experiments. While the field is moving toward 
sequencing-based methods for expression analysis, an 



overwhelming majority of the existing and even newly 
uploaded data are still from microarray platforms. 
The existing microarray-based data represent a huge 
investment and being able to utilise it efficiently 
as background information in new sequencing-based 
studies is of great interest. 

Recently there has been significant interest in util- 
ising the large public databases to holistically char- 
acterise phenotypes based on expression in new sam- 
ples [^. Most work utilising these large databases is 
based on differential expression |4}|6], but Schmid et 
al. [3] argue that absolute expression can yield a more 
comprehensive picture. All of these methods are cur- 
rently restricted to microarray data, which severely 
limits their utility in new studies. 

RNA-seq and microarrays are based on very dif- 
ferent principles and ultimately measure different 
things 7]. Numerous experimental comparisons have 
demonstrated RNA-seq and microarrays to yield 
broadly comparable results [8 ■ 14 . These results 
demonstrate that the platforms typically agree on dif- 
ferentially expressed genes between sufficiently differ- 
ent samples, although RNA-seq tends to be more sen- 
sitive. For measures of absolute expression, there is 
typically a clear correlation, the level of which ranges 
from moderate to very high depending on the exam- 
ple. 

In this paper we present a method for processing 
RNA-seq data in a way to make the resulting ex- 
pression measures significantly more comparable with 



measures derived from microarray data by estimat- 
ing the expression level at the microarray probe re- 
gions using a method we call PREBS (Probe Region 
Expression estimation Based on Sequencing). The 
improvement is especially significant for measures of 
absolute expression. This improved comparability 
comes at the expense of ignoring some information 
in the RNA-seq data by focusing the analysis to re- 
gions covered by the microarray probes. 

2 MATERIALS AND METH- 
ODS 

2.1 Basic description of the method 

One of the fundamental differences between microar- 
ray and RNA-seq technologies is that microarrays, es- 
pecially now ubiquitous oligonucleotide arrays, mea- 
sure gene expression based on the parts of the gene 
where probe sequences are located 
seq measures expression over the 
quence 



15] while RNA- 
whole gene se- 
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The idea of our method is to eliminate 
this difference by calculating RNA-seq gene expres- 
sion measures only based on the parts of the gene 
where microarray probe sequences are located. 

Traditionally gene expression is estimated from 
RNA-seq data by counting the number of reads 
that overlap with exons of the gene (count meth- 
ods) [l6p7 . The analysis in higher eukaryotes can be 
complicated by alternative splicing. To account for 
this, several methods have been proposed based on 
deconvolution of transcript isoform expression using 
probabilistic models 18-21 , but these methods still 



estimate the expression level across the whole gene. 
In PREBS method we estimate probe region ex- 
pression by counting the number of reads that over- 
lap with probe regions and using a statistical model 
to infer the expression level from the read counts. 
We treat the inferred probe region expression levels 
in a similar way as they are treated in computational 
microarray processing pipelines. In particular, we ap- 
ply the popular microarray data summarisation algo- 
rithm RMA ^2J commonly used for Affymetrix data 
analysis. The details of the RMA algorithm appli- 
cation and the statistical model used to infer probe 



regions will also be described in later sections. 

Using the described method we aim to computa- 
tionally process RNA-seq data in a way that is simi- 
lar to microarray computational processing pipelines. 
In the Results section we show that gene expression 
measures that we get from RNA-seq data this way are 
more similar to microarray measures than the mea- 
surements that we get using conventional RNA-seq 
processing methods. We call our RNA-seq process- 
ing method PREBS (Probe Region Expression esti- 
mation Based on Sequencing). 

2.2 Probe region expression estima- 
tion from RNA-seq 

Read sampling in sequencing is inherently a stochas- 
tic process. To account for the uncertainty this in- 
duces, we use statistical methods to infer the probe 
region expression level from read data. 

We assume that the number of reads from a region 
with a given expression level follows the Poisson dis- 
tribution. Placing a conjugate gamma prior on the 
expression level, we obtain an estimate of the expres- 
sion level as the mean of the posterior distribution. 
The hyperparameters of the prior are determined us- 
ing an empirical Bayesian approach by maximising 
the marginal likelihood of the full data. 

2.3 Expression summarisation by 
PREBS-RMA 

Affymetrix microarray probes are grouped into probe 
sets containing 15-20 perfect match / mismatch probe 
pairs. Perfect match probes are completely com- 
plementary to gene portion they are interrogating 
while mismatch probes have their middle nucleotide 
changed. Some algorithms like MAS5 123^ use ex- 
pression values from mismatch probes to account for 
non-specific binding while RMA completely ignores 
mismatch probe values and uses only the perfect 
match probes. In order to simplify comparison with 
gene-based RNA-seq analysis methods, we use cus- 
tom probe set mapping (custom CDF files) i24. 

We used a modified version of the rma function 
in the R/Bioconductor Affy package [25] to apply 



the RMA algorithm to probe region expression es- 
timates. Our most significant modification was to 
drop the background correction step. Unhke microar- 
rays, RNA-seq data contains very httle noise 17 , so 



this step is not needed for RNA-seq based probe ex- 
pression estimates. We also modified the rma func- 
tion to take custom microarray probe expression val- 
ues instead of taking raw GEL files and calculating 
probe expression by itself as it does by default. Since 
RMA algorithm normalises all samples at the same 
time [22| , custom microarray probe expression values 
have to be supplied to the function from all of the 
samples simultaneously. The two other major steps of 
RMA algorithm, normalisation and summarisation, 
were left unchanged and performed as they are im- 
plemented in Affy package. 

2.4 Tools used for implementation 

In order to evaluate the effectiveness of our method 
(PREBS) we compared it to representatives of two 
RNA-seq analysis methods: count-based [l6] ("Read 
counting") and isoform deconvolution ("MMSEQ"). 
We processed sequencing data using each of the meth- 
ods and evaluated their agreement with microarray 
data by calculating correlations of gene expression. 

For the PREBS method, reads were mapped 
by TopHat software version 1.4.1 
man genome version GRCh37.65. We 
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to Hu- 
considered 

only unique genomic alignments to annotated tran- 
scripts. Locations for probe regions were retrieved 
from Custom CDF file annotations (version 15.0.0 
ENSG) [24] and the read overlaps with these regions 
were calculated using GenomicRanges package from 
R/Bioconductor [27]. Probe region expressions were 
calculated as described above and fed to a modi- 
fied version of the rma function from R/Bioconductor 
Affy package. 

Read counting RPKM values were calculated using 
the same tools as in PREBS method, but read overlap 
counts were calculated for Ensembl genomic annota- 
tions that were downloaded using GenomicFeatures 
package. RPKM values were calculated using these 
counts and logj values were taken. 

For isoform deconvolution we used MMSEQ [20] 
(software version 0.9.18). Bowtie software (ver- 



sion 0.12.7) [28] was used to map the reads to the 
transcriptome, as recommended by MMSEQ man- 
ual. MMSEQ options were set to default and 
Bowtie options were set as recommended by MMSEQ 
(-a —best —strata -S -m 100 -X 400). Hu- 
man transcriptome version GRCh37.65 from Ensembl 
database was used. MMSEQ output values were con- 
verted from natural logarithm scale to log2 scale. 

Affymetrix microarray data were processed using 
Affy package and custom CDF files. Microarray ex- 
pression values were summarised using rma function 
from Affy package. In case of multiple replicates, the 
mean value was taken as an absolute expression es- 
timate for each state. RMA-summarised values were 
in log2 scale, so no logarithm base conversion was 
needed. 

PREBS has a unique feature to calculate expres- 
sion levels for the original microarray probe sets us- 
ing manufacturer's CDF files. Unfortunately, probe 
mapping locations are not provided for manufac- 
turer's CDF files, so probe sequences were mapped to 
the latest human genome build (hgl9) using Bowtie 
(version 0.12.7). 

Significance tests between the observed correla- 
tions differences were performed using r.test func- 
tion from psych package in R. 



3 RESULTS 

3.1 Data sets 

We evaluated the performance of our method on 
two data sets: Marioni et al. [8j and Acute myeloid 
leukaemia (LAML) from The Cancer Genome Atlas 
(TCGA) database. These will be referred to as Mar- 
ioni and LAML data sets, respectively. Both of these 
data sets have paired RNA-seq and microarray data. 
Marioni data set has two samples, human kidney and 
liver, both of which were used for testing. LAML data 
set has 183 samples for which both microarray and 
RNA-seq data is available, but we used only a subset 
of 17 samples for testing, because of the hard disk 
memory limitations (samples used: 2803, 2805, 2806, 
2807, 2810, 2812, 2814, 2816, 2817, 2818, 2819, 2820, 
2821, 2822, 2825, 2826, 2828). Samples that were 
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Figure 1: Absolute RNA-seq expression values computed using different methods plotted against microarray 
values. The figures show 60% of most highly expressed genes. First row shows results for the Kidney sample 
from the Marioni et al. data set and the second row for the 2803 sample from the LAML data set. The 
legend contains Pearson correlation (r) and the number of genes (n). 



used in LAML data set were selected based on numer- 
ical order, excluding the samples that had problems 
running the read mapping. In both of the data sets 
RNA-seq platform was lUumina Genome Analyzer II 
and microarray platform was Affymetrix U133 Plus 
2. 

3.2 Absolute expression comparison 

We processed RNA-seq data using three differ- 
ent methods: count-based, deconvolution-based and 
PREBS, and the microarray data using RMA. In 
most gene expression studies, low expressed genes are 



filtered out, because their measurements are noisy 
and unreliable. Common filtering thresholds for 
RNA-seq vary around 0.3 RPKM [29 . This fraction 
accounts for 70% of top expressed genes in Marioni 
data set and 60.9% of top expressed genes in LAML 
data set. To make the filtering uniform among all of 
the data sets and methods, we have decided to use 
60% of top expressed genes. 

Scatter plots of RNA-seq gene expression versus 
microarray gene expression and associated Pearson 
correlations for each of the three RNA-seq processing 
methods are displayed in Figure [I] All RNA-seq pro- 
cessing methods show a reasonable correlation with 
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Figure 2: Absolute expression correlations between different RNA-seq processing methods and the microarray 
for different numbers of top expressed genes. The correlations are averaged among all samples in the 
corresponding data sets, Marioni et al. data set (a) and LAML data set (b). 



the microarray. PREBS reaches the highest Pearson 
correlation both on Marioni (r — 0.78) and LAML 
(r = 0.83) data sets. The scatter plots are shown only 
for the first sample in each data set (kidney sample 
in Marioni and 2803 sample in LAML data set). The 
plots for the other samples were quite similar to the 
ones shown, so we do not present them here. 

Figure [2] shows average Pearson correlations plot- 
ted as a function of the fraction of most highly ex- 
pressed genes selected from the sample. The averages 
are computed over all samples in the data set. We 
can clearly see that PREBS has significantly higher 
correlation than the two other methods. The differ- 
ence is especially large for smaller numbers of top 
expressed genes on both of the data sets. 

We tested for significance of difference between 
PREBS vs microarray correlation and read count- 
ing or MMSEQ vs microarray correlation. All of the 
observed correlation differences were significant with 
p- values lower than 10~^. 



3.3 Differential expression compari- 
son 

Similarly to the absolute expression case, we com- 
pared the three RNA-seq processing methods based 
on agreement with microarrays in differential expres- 
sion measurements. For each of the methods and 
for microarrays we calculated log2 fold change values 
of gene expression between two states. For Marioni 
data set the logj fold change values were calculated 
between the two available samples: kidney and liver. 
For LAML data set we calculated log2 fold changes 
between all possible pairs of 17 samples, but in the 
scatter plots we present the results only for the first 
two samples (2803 and 2805). 

Figure |3] shows scatter plots similar to Figure [l] 
for genes that belong to 60% of top expressed genes 
in both conditions. In differential expression scatter 
plots, we can see that the agreement between mi- 
croarray and RNA-seq is higher than in the absolute 
expression case (Figure II]), and the differences be- 
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Figure 3: logj fold change values for differential expression estimated using different RNA-seq analysis 
methods plotted against corresponding microarray log2 fold change values. The figures show 60% of most 
highly expressed genes. Top row shows the fold changes between the kidney and liver samples from the 
Marioni data set are used for Marioni data set, while bottom row shows changes between samples 2803 and 
2805 from the LAML data set. 



tween different methods are in general small. 

The correlation plots in Figure |4] show the average 
correlation over all possible sample pairs (1 pair for 
Marioni data set and ( 2 ) = 136 pairs for LAML data 
set). Again, the performances of all of the methods 
are very similar. In both data sets PREBS performs 
slightly better on the higher end of expression (10- 
20%) , but slightly worse on the lower end of expres- 
sion (50-60%). 

In order to see whether the methods agree on which 
of the genes are significantly differentially expressed 
we have plotted Venn diagrams in Figure [5] We con- 



sidered genes that have absolute value of log2 fold 
change greater than 1.5 as significantly differentially 
expressed (the same criteria was used in 30 ). We 



chose this criteria, because it does not require sta- 
tistical tests, so the results are more comparable be- 
tween different methods. For the sake of simplicity, 
we did not include the read counting method into the 
diagrams. 

From the Venn diagrams we can see that PREBS 
has better correlation with microarray results by hav- 
ing less genes not detected by other methods and 
having many more genes in common with microar- 
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Figure 4: Average correlations of estimated log2 foM change values between (different RNA-seq processing 
methods and the microarray for different fractions of most highly expressed genes. The values are averaged 
among all pairs of samples in the corresponding data sets, Marioni in (a) and LAML in (b). 



rays in the LAML data set. MMSEQ finds more 
differentially expressed genes than either PREBS or 
microarray in both data sets. The added sensitiv- 
ity arises most likely because it uses read data from 
the whole gene regions, while PREBS restricts itself 
only to the gene regions where microarray probes are 
located. Overall, this again confirms that PREBS 
results agree with microarray better than MMSEQ 
results. 



3.4 Cross-platform 
pression 



differential ex- 



Better comparability between microarray and RNA- 
seq data also allows completely new operations, such 
as cross-platform differential expression analysis be- 
tween samples measured with different technologies. 
This is a very difficult task because RNA-seq and 
microarray measures suffer from different biases, and 
the results of any such analysis should always be in- 



terpreted with care. 

To compute the cross-platform differential expres- 
sion fold change we perform an extreme quantile 
normalisation by replacing RNA-seq gene expression 
measures by microarray gene expression measures 
having corresponding ranks in the coupled experi- 
ment. This way, we have not changed the relative 
order expression levels, but made the dynamic ranges 
of the two platforms identical. 

The correlation plots of logj fold changes for cross- 
platform differential gene expression are shown in 
Figure |6J We can see that PREBS significantly out- 
performs the two other methods both on Marioni and 
LAML data set and can reach a fairly reasonable level 
of correlation especially with the Marioni data. The 
relative performances of the different methods mirror 
those in Figure [2] because the performance depends 
mainly on similarity of absolute expression measures. 
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Figure 5: Venn diagrams illustrating the similarities of lists of genes called differentially expressed by dif- 
ferent methods. We call genes with the absolute value of logj fold change higher than 1.5 as significantly 
differentially expressed. 

3.5 Probe set expression calculation correlation with original probe set expression levels. 

The PREBS approach allows analysing RNA-seq 
data more like microarray data and actually com- 
puting expression values for microarray probe sets. 

Figure [7] shows the scatter plots for absolute 
and differential probe set expressions using PREBS 
method on Marioni data set. Since there is no easy 
way to calculate MMSEQ and read counting expres- 
sion measures for microarray probe sets, we did not 
compare PREBS values with these two methods in 
this case. 

Comparing PREBS vs microarray expression cor- 
relations of the two settings we see that the corre- 
lations for manufacturer's probe sets (Figure u\ are 
slightly lower than the correlations for genes using 
custom CDF files (Figure fl| and Isl). However, this is 
most likely due to the fact that there are many more 
probe sets than genes and the estimation of the corre- 
sponding individual expression levels is less reliable. 
Overall, PREBS provides a very reasonable level of 



3.6 Retrieval of similar RNA-seq— 
microarray experiments 

The higher similarity of RNA-seq and microarray 
data allows combining the two types of data in in- 
formation retrieval applications, where the aim is to 
retrieve similar experiments based on the content, 
i.e. the signature of expressed genes. This kind of 
joint modelling will significantly increase the utility 
of methods for content-based organisation of large 
gene expression databases such as [SJ. 

PREBS provides a much higher absolute gene ex- 
pression correlation with microarrays than the other 
methods, so it is likely that it is more suitable for 
similar RNA-scq-microarray experiment retrieval. In 
order to confirm that, we conducted an experiment 
where for several RNA-seq experiments a matching 
microarray experiment had to be retrieved from a 
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Figure 6: Average correlations of log2 fold changes estimated from RNA-seq-microarray sample pair com- 
pared to corresponding fold changes from microarray sample pairs as a function of number of most highly 
expressed genes considered. The values are averaged among all pairs of samples in the corresponding data 
sets, Marioni in (a) and LAML in (b). 



database. 

For this experiment we used the 183 microarray 
samples in the LAML data set, 17 of which had a 
matched RNA-seq pair. For each RNA-seq experi- 
ment we calculated gene expression estimate correla- 
tion with all microarray experiments. Accuracy was 
measured by how often the correct pair had the high- 
est correlation. Accuracy of retrieval was calculated 
for all three RNA-seq processing methods: PREBS, 
MMSEQ and read counting. 

To evaluate the performance of the methods us- 
ing different sized signatures, we evaluated the meth- 
ods performance with different numbers of top ex- 
pressed genes. As we can see in the results in Fig- 
ure |8j PREBS clearly outperforms the other meth- 
ods, especially when relatively small subsets of most 
highly expressed genes are used as signatures. Look- 
ing this another way, PREBS can provide similar ac- 
curacy with a signature that is 1/5 the size needed by 



other methods, which can provide significant compu- 
tational savings in modelling large databases. 

4 DISCUSSION 

Our results clearly demonstrate that the PREBS 
method is able to produce from RNA-seq data gene 
expression estimates that are significantly more sim- 
ilar to microarray estimates than standard process- 
ing pipelines. What is more, the probe set focused 
processing allows obtaining estimates for original mi- 
croarray probe sets, which is not possible with exist- 
ing methods. This will greatly aid in building inte- 
grated models of large gene expression databases that 
contain both microarray and RNA-sequencing data. 
PREBS greatly improves the comparability of ab- 
solute expression measures, but it does not provide 
a significant improvement for differential expression 
analysis. This may in part be explained by microar- 
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Figure 7: Estimated absolute expression values (a) and log2 fold changes values (b) for original microarray 
probe sets. The plots show 60% most highly expressed genes in the Marioni data set. 



ray probes that target the gene sequence subopti- 
mally, possibly focusing only on a small fraction of 
its alternatively spliced isoforms. This introduces a 
gene-specific bias to the expression estimates. When 
computing the difference between multiple samples, 
these biases tend to cancel. The good performance 
of PREBS suggests that this probe focusing is likely 
the most significant gene-specific bias in microarrays. 
Other biases related to for example different melting 
points and affinities of the probes depending on the 
sequence content exist, but they seem less important. 
Learning a model of these biases is an important av- 
enue of future work, but a detailed model will require 
a significant amount of paired RNA-seq-microarray 
data. 



5 CONCLUSION 

Different experimental techniques for measuring gene 
expression produce different results partly because 
they measure different things, such as different parts 



of the gene sequence. In this work we have pre- 
sented the PREBS method which aims to eliminate 
this difference from RNA-seq and microarray gene 
expression analyses by focusing the RNA-seq sum- 
marisation to microarray probe regions. Combining 
this with a standard microarray data processing algo- 
rithm leads to estimates of absolute expression that 
are significantly more similar to ones measured from 
the same samples using microarrays than standard 
RNA-seq processing techniques. The difference be- 
tween the methods is much smaller in differential 
expression, presumably because gene-specific biases 
cancel out in the differential analysis. 

Diminishing the differences between different gene 
expression measurement platforms paves the way for 
integrative modelling of large genomic data sets and 
big genome data applications. We have demonstrated 
that the PREBS approach can lead to increased accu- 
racy in a simplified content-based genomic informa- 
tion retrieval task. Extending this success to a real- 
istic integrative modelling system is a very attractive 
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Figure 8: Average precision of retrieving the corre- 
sponding microarray experiment from a large collec- 
tion based on correlation with expression estimates 
from RNA-seq as a function of the number of genes 
used as the signature. Accuracy is measured as a 
fraction of the samples which have the largest corre- 
lation with its true pair. 



avenue of future research. 
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